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ABSTRACT : We present new M-estimators of the mean and variance of real valued 
random variables, based on PAC-Bayes bounds. We analyze the non-asymptotic minimax 
properties of the deviations of those estimators for sample distributions having either a 
bounded variance or a bounded variance and a bounded kurtosis. Under those weak hy- 
potheses, allowing for heavy-tailed distributions, we show that the worst case deviations 
of the empirical mean are suboptimal. We prove indeed that for any confidence level, 
there is some M-estimator whose deviations are of the same order as the deviations of 
the empirical mean of a Gaussian statistical sample, even when the statistical sample is 
instead heavy-tailed. Experiments reveal that these new estimators perform even better 
than predicted by our bounds, showing deviation quantile functions uniformly lower at 
all probability levels than the empirical mean for non-Gaussian sample distributions as 
simple as the mixture of two Gaussian measures. 
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1. Introduction 

This paper is devoted to the estimation of the mean and possibly also of the 
variance of a real random variable from an independent identically distributed 
sample. While the most traditional way to deal with this question is to focus on 
the mean square error of estimators, we will instead focus on their deviations. De- 
viations are related to the estimation of confidence intervals which are of impor- 
tance in many situations. While the empirical mean has an optimal minimax mean 
square error among all mean estimators in all models including Gaussian distribu- 
tions, its deviations tell a different story. Indeed, as far as the mean square error is 
concerned, Gaussian distributions represent already the worst case, so that in the 
framework of a minimax mean least square analysis, no need is felt to improve 
estimators for non-Gaussian sample distributions. On the contrary, the deviations 
of estimators, and especially of the empirical mean, are worse for non-Gaussian 
samples than for Gaussian ones. Thus a deviation analysis will point out possible 
improvements of the empirical mean estimator more successfully. It was nonethe- 
less quite unexpected for us, and will undoubtedly also be for some of our readers, 
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1 Introduction 



that the empirical mean could be improved, under such a weak hypothesis as the 
existence of a finite variance, and that this has remained unnoticed until now. One 
of the reasons may be that the weaknesses of the empirical mean disappear if we 
let the sample distribution be fixed and the sample size go to infinity. This does 
not mean however that a substantial improvement is not possible, nor that it is 
only concerned with specific sample sizes or weird worst case distributions : in 
the end of this paper, we will present experiments made on quite simple sample 
distributions, consisting in the mixture of two to three Gaussian measures, show- 
ing that more than twenty five percent can be gained on the widths of confidence 
intervals, for realistic sample sizes ranging from 100 to 2000. We think that, be- 
yond the technicalities involved here, this exemplifies more broadly the pitfalls of 
asymptotic studies in statistics and should be quite thought provoking about the 
notions of optimality commonly used to assess the performance of estimators. 

Our deviation study will use two kinds of tools: M-estimators to truncate ob- 
servations and PAC-Bayesian theorems to combine estimates on the same sample 
without using a split scheme lfl3lfl2l[l4l l8ll2l[Tl. 

Its general conclusion is that, whereas the deviations of the empirical mean es- 
timate may increase a lot when the sample distribution is far from being Gaussian, 
those of some new M-estimators will not. The improvement is the best for heavy- 
tailed distributions, as the worst case analysis performed to prove lower bounds 
will show. The improvement also increases as the confidence level at which devi- 
ations are computed increases. 

Similar conclusions can be drawn in the case of least square regression with 
random design (Hill. Discovering that using truncated estimators permits to get 
rid of sub Gaussian tail assumptions was the spur to study the simpler case of 
mean estimation for its own sake. Restricting the subject in this way (which is of 
course a huge restriction compared with least square regression) makes it possible 
to propose simpler dedicated estimators and to push their analysis further. It will 
indeed be possible here to obtain mathematical proofs for numerically significant 
non-asymptotic bounds. 

The weakest hypothesis we will consider is the existence of a finite but un- 
known variance. In our M-estimators, adapting the truncation level depends on 
the value of the variance. However, this adaptation can be done without actually 
knowing the variance, through Lepski's approach. 

Computing an observable confidence interval, on the other hand, requires 
more information. The simplest case is when the variance is known, or at least 
lower than a known bound. If it is not so, another possibility is to assume that the 
kurtosis is known, or lower than a known bound. Introducing the kurtosis is natu- 
ral to our approach: in order to calibrate the truncation level for the estimation of 
the mean, we need to know the variance, and in the same way, in order to calibrate 
the truncation level for the estimation of the variance, we need to use the variance 
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of the variance, which is provided by the kurtosis as a function of the variance 
itself. 

In order to assess the quality of the results, we prove corresponding lower 
bounds for the best estimator confronted to the worst possible sample distribu- 
tion, following the minimax approach. We also compute lower bounds for the 
deviations of the empirical mean estimator when the sample distribution is the 
worst possible. These latter bounds show the improvement that can be brought 
by M-estimators over the more traditional empirical mean. We plot the numerical 
values of these upper and lower bounds against the confidence level for typical 
finite sample sizes to show the gap between them. 

The reader may wonder why we only consider the following extreme models, 
the narrow Gaussian model and the broad models 

= {P£ M^(R) : Var P < t; max }, (1.1) 

and ^tWAmax = {P G A Vmax : < «max}> (1.2) 

where Var P is the variance of P, k p its kurtosis, and M+(R) is the set of prob- 
ability measures (positive measures of mass 1) on the real line equipped with the 
Borel sigma-algebra. 

The reason is that, the minimax bounds obtained in these broad models being 
close to the ones obtained in the Gaussian model, introducing intermediate models 
would not change the order of magnitude of the bounds. 

Let us end this introduction by advocating the value of confidence bounds, 
stressing more particularly the case of high confidence levels, since this is the 
situation where truncation brings the most decisive improvements. 

One situation of interest which we will not comment further is when the es- 
timated parameter is critical and making a big mistake on its value, even with a 
small probability, is unacceptable. 

Another scenario to be encountered in statistical learning is the case when lots 
of estimates are to be computed and compared in the course of some decision 
making. Let us imagine, for instance, that some parameter 9 e © is to be tuned 
in order to optimize the expected loss E of some family of loss functions 

{fo : 9 E 0} computed on some random input X. Let us consider a split sample 

scheme where two i.i.d. samples (X±, . . . , X s ) = f X{ and (X s+ i, . . . , X s+n ) = 
Xg+i are used, one to build some estimators Q k (X{) of argmin egefc E[/ e (X)] in 
subsets 0fc, k = 1, . . . , K of 0, and the other to test those estimators and keep 
hopefully the best. This is a very common model selection situation. One can 
think for instance of the choice of a basis to expand some regression function. If 
K is large, estimates of E [/^ xs j (X s+ i)] will be required for a lot of values of k. 
In order to keep safe from over-fitting, very high confidence levels will be required 
if the resulting confidence level is to be computed through a union bound (because 
no special structure of the problem can be used to do better). Namely, a confidence 
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level of 1 — e on the final result of the optimization on the test sample will require 
a confidence level of 1 — e/K for each mean estimate on the test sample. Even 
if e is not very small (like, say, 5/100), e/K may be very small. For instance, 
if 10 parameters are to be selected among a set of 100, this gives K = (^\ ~ 
1.7- 10 13 . In practice, except in some special situations where fast algorithms exist, 
a heuristic scheme will be used to compute only a limited number of estimators 
6k- An example of heuristics is to add greedily parameters one at a time, choosing 
at each step the one with the best estimated performance increase (in our example, 
this requires to compute 1000 estimators instead of ( 10 °))- Nonetheless, asserting 
the quality of the resulting choice requires a union bound on the whole set of 
possible outcomes of the data driven heuristic, and therefore calls for very high 
confidence levels for each estimate of the mean performance ~E[ffi, xs JXs+i)\ 
on the test set. 

The question we are studying in this paper should not be confused with robust 
statistics [Hi. The most fundamental difference is that we are interested in 
estimating the mean of the sample distribution. In robust statistics, it is assumed 
that the sample distribution is in the neighbourhood of some known parametric 
model. This gives the possibility to replace the mean by some other location 
parameter, which as a rule will not be equal to the mean when the shape of the 
distribution is not constrained (and in particular is not assumed to be symmetric). 

Other differences are that our point of view is non- asymptotic and that we 
study the deviations of estimators whereas robust statistics is focussed on their 
asymptotic mean square error. 

Although we end up defining M-estimators with the help of influence func- 
tions, like in robust statistics, we use a truncation level depending on the sample 
size, whereas in robust statistics, the truncation level depends on the amount of 
contamination. Also, we truncate at much higher levels (that is we eliminate less 
outliers) that what would be advisable for instance in the case of a contaminated 
Gaussian statistical sample. Thus, although we have some tools in common with 
robust statistics, we use them differently to achieve a different purpose. 

Adaptive estimation of a location parameter lfT5l l5ll6ll is another setting where 
the empirical mean can be replaced by more efficient estimators. However, the 
setting studied by these authors is quite different from ours. The main difference, 
here again, is that the location parameter is assumed to be the center of symmetry 
of the sample distribution, a fact that is used to tailor location estimators based 
on a symmetrized density estimator. Another difference is that in these papers, 
the estimators are built with asymptotic properties in view, such as asymptotic 
normality with optimal variance and asymptotic robustness. These properties, 
although desirable, give no information on the non-asymptotic deviations we are 
studying here, and therefore do not provide as we do non-asymptotic confidence 
intervals. 
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2. Some new M-estimators 



Let (Yi)f =1 be an i.i.d. sample drawn from some unknown probability dis- 
tribution P on the real line R equipped with the Borel a-algebra ¥>. Let Y be 
independent from (1^)" =1 with the same marginal distribution P. Assuming that 
Y G L 2 (P), let m be the mean of Y and let v be its variance: 



E(Y) = m and E[(Y - m) 



v. 



Let us consider some non-decreasing influence function if) : R — > R such 



that 



- log(l - x + x 2 /2) < ip(x) < log(l + x + x 2 /2) . 
The widest possible choice of if) compatible with these inequalities is 



ip(x) 



log(l +x + x 2 /2), x>0, 
■log(l -x + x 2 /2), x<0, 



whereas the narrowest possible choice is 



if>{x) 



X > 1, 



flog(2), 

-log(l-x + x 2 /2), < x < 1, 

log(l + x + x 2 /2), -l<x<0, 

l-log(2), x<-l. 



(2.1) 



(2.2) 



(2.3) 



Although if) is not the derivative of some explicit error function, we will use it in 
the same way, so that it can be considered as an influence function. 

Indeed, a being some positive real parameter to be chosen later, we will build 
our estimator 9 a of the mean m as the solution of the equation 



J2^HYi-9 a )] =0. 



i=l 



(When the narrow choice of if) defined by equation ( |2.3[ ) is made, the above equa- 
tion may have more than one solution, in which case any of them can be used to 
define 9 a .) 



2 We would like to thank one of the anonymous referees of an early version of this paper for 
pointing out the benefits that could be drawn from the use of a non-decreasing influence function 
in this section. 
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2 Some new M-estimators 



Plot of X (-)• ip(x) 
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Identity map x i-» x 
~~ Narrow influence function 
Wide influence function 
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The widest choice of ip is the one making 9 a the closest to the empirical mean. 
For this reason it may be preferred if our aim is to stabilize the empirical mean by 
making the smallest possible change, which could be justified by the fact that the 
empirical mean is optimal in the case when the sample (Yi)f =1 is Gaussian. 

Our analysis of 9 a will rely on the following exponential moment inequalities, 
from which deviation bounds will follow. Let us introduce the quantity 

K0) =— y>w y i-0)]> ^R. 

an ' 



i=l 



Proposition 2.1 

E|exp[anr(^)]| <|l 

< exp|no;(m - 9) + [v + (m - 9) 2 ] |. 

In the same way 



E 



2 i 

jexp[-cmr(0)]} <|l - a(m - 9) + y[^ + (m - 9) 2 ] | 
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r na 2 r . 21 "I 

< exp| — na(m — 9) H — — [u + (m — t/) J j 



The proof of this proposition is an obvious consequence of inequalities ( |2.1| ) and 
of the fact that the sample is assumed to be i.i.d. It justifies the special choice of 
influence function we made. If we had taken for ip the identity function, and thus 
for 6 a the empirical mean, the exponential moments of r{9) would have existed 
only in the case when the random variable Y itself has exponential moments. In 
order to bound 9 a , we will find two non-random values 6L and 9 + of the parameter 
such that with large probability r(6L) > > r{9 + ), which will imply that < 
6 a < 9 + , since r{9 a ) = by construction and 9 i— >■ r{9) is non-increasing. 

Proposition 2.2 The values of the parameters a e R+ and e g)0, 1( being set, 
let us define for any 9 G R the bounds 

B+ (9) = m -9 + ^[v + (m- 9) 2 ] + ^ X 
2 L J not 



B_(9) = m - 9 - - [v + (m - 9f] - 



2 L J not 

They satisfy 

F[r(9) < B+(0)] > 1 - e, P[r(0) > > 1 - e. 

The proof of this proposition is also straightforward: it is a mere consequence of 
Chebyshev's inequality and of the previous proposition. Let us assume that 

„ 21og(e" 1 ) 

a 2 v + ^ '- < 1. 

n 

Let 9 + be the smallest solution of the quadratic equation B + {9 + ) = and let 
be the largest solution of the equation _B_(0_) = 0. 

Lemma 2.3 



av 

i" + I ~2 + 



, aw 
< //' i I — + 



0_ = m- h 



log(e 




an 




log(e" 


X ) 


an 




log(e- 





2 an 
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2 Some new M-estimators 



^2 an 2 n J 

Moreover, since the map 9 i— >■ r(9) is non-increasing, with probability at least 
l-2e, 

e_ < e a < e+. 

The proof of this lemma is also an obvious consequence of the previous propo- 
sition and of the definitions of 8 + and 9- . Optimizing the choice of a provides 



PROPOSITION 2.4 Let us assume that n > 2 log(e *) and let us consider 



n 



\ 



2v log(e 



n 1 



21og(e- 



n 



and 



a 



'21og(e-i) 
n(v + rf) 



In this case 9 + = m + rj and 9_ = in — rj, so that with probability at least 1 — 2e, 
\m — 9 a \ < rj = 



\ 



2v log(e 



n 1 - 



21og(e 



n 



to ?/ze same way, ifwe wan? to mafe a choice of a independent from e, we can 
choose 

^2 



a 




nv 



and assume that 



n > 2[l + log(e~ 1 
In this latter case, with probability at least 1 — 2e, 

l+logfe" 1 ) 



— to < 



11/ 2[l+log(e-i)] 
2 2V 




2n 



< 



1 



l + log(e- 1 )_^^ 
2n 



1 + logfe- 




n 



n 



In the following plots, we compare the bounds on the deviations of our M- 

_ 1 n 

estimator # a with the deviations of the empirical mean M = — Y t when the 

n < ^ 



i=l 



sample distribution is Gaussian and when it belongs to the model A\ defined in 
the introduction by equation ( 1.1[ page [3]). (The bounds for the empirical mean 
will be explained and proved in subsequent sections.) 
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More precisely, the deviation upper bounds for our estimator 9 a for the worst 
sample distribution in A\, the model defined by page [3]), are compared with 
the exact deviations of the empirical mean M of a Gaussian sample. This is the 
minimax bound at all confidence levels in the Gaussian model, as will be proved 
later on. Consequently, the deviations of our estimator cannot be smaller for the 
worst sample distribution in A\, which contains Gaussian distributions. We see on 
the plots that starting from e = 0.1 (corresponding to a 80% confidence level), our 
upper bound is close to being minimax, not only in A\, but also in the small Gaus- 
sian sub-model. This shows that the deviations of our estimator are close to reach 
the minimax bound in any intermediate model containing Gaussian distributions 
and contained va.A\. 

Our estimator is also compared with the deviations of the empirical mean for 
the worst distribution in Ai, (to be established later). In particular the lower bound 
proves that there are sample distributions in A\ for which the deviations of the 
empirical mean are far from being optimal, showing the need to introduce a new 
estimator to correct this. 

In the first plot, we chose a sample size n = 100 and plotted the deviations 
against the confidence level (or rather against e, the confidence level being 1 — 2e). 

100, v = 1 



n 



s 
e 



■ ^ 



1 


1 1 N, 1 1 
% 


1 1 | 1 1 1 1 1 1 1 1 | 1 

| 6 a upper bound, worst sample in A\ 


i i 






■\ | M upper bound, Gaussian sample 

M upper bound, worst sample in Ai 
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j \\ 
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j 

ill i 1 i 





1(1 



2c 



e, starting from 0.5, the confidence level being 1 

As shown on the second plot, showing a wider range of e values, our bound 
stays close to the Gaussian bound up to very high confidence levels (up to e = 
10~ 9 and more). On the other hand, it already outperforms the empirical mean by 
a factor larger than two at confidence level 98% (that is for e = 10~ 2 ). 
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2 Some new M-estimators 




When we increase the sample size to n = 500, the performance of our M- 
estimator is even closer to optimal. 

n = 500, v = 1 



^3 lo 
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e, starting from 0.5, the confidence level being = 1 — 2e 
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3. Adapting to an unknown variance 



In this section, we will use Lepski's renowned adaptation method [fTTTl when 
nothing is known, except that the variance is finite. Under so uncertain, (but un- 
fortunately so frequent) circumstances, it is impossible to provide any observable 
confidence intervals, but it is still possible to define an adaptive estimator and to 
bound its deviations by unobservable bounds (depending on the unknown vari- 
ance). To understand the subject of this section, one should keep in mind that 
adapting to the variance is a weaker requirement than estimating the variance : 
estimating the variance at any predictable rate would require more prior informa- 
tion (bearing for instance on higher moments of the sample distribution). 

The idea of Lepski's method is powerful and simple : consider a sequence 
of confidence intervals obtained by assuming that the variance is bounded by a 
sequence of bounds Vh and pick up as an estimator the middle of the smallest 
interval intersecting all the larger ones. For this to be legitimate, we need all the 
confidence regions for which the variance bound is valid to hold together, which 
is performed using a union bound. 

Let us describe this idea more precisely. Let 9(v max ) be some estimator of 
the mean depending on some assumed variance bound v max , as the ones described 
in the beginning of this paper. Let 5(v max , e) G R + U {+00} be some deviation 
bouncQ proved in A Vrn ^\ namely let us assume that for any sample distribution in 
with probability at least 1 — 2e, 

\m - 6(v max )\ < 5(v max , e). 

Let us also decide by convention that S(v max , 0) = +00. 

Let v E M+(R + ) be some coding atomic probability measure on the positive 
real line, which will serve to take a union bound on a (countable) set of possible 
values of f max - 

We can choose for instance for v the following coding distribution : expressing 
Vx by comparison with some reference value V, 



k=0 



we set 



V2 S J2 ^\ s G Z, d G IN, (c fc )Lo e {0. Co = Q = 1, 

2-2(d-l) 



5(H + 2)(|a|+3) 

and otherwise we set i/(t) mEK ) = 0. It is easy to see that this defines a probability 
distribution on R + (supported by dyadic numbers scaled by the factor V). It is 

3 A Vmax is defined by page[3jl 
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3 Adapting to an unknown variance 



clear that, as far as possible, the reference value V should be chosen as close as 
possible to the true variance v. 

Another possibility is to set for some parameters V E R, p > 1 and seK, 

u{yp2k) = 27TT' keZ >\ k \^ s - t 3 - 1 ) 

Let us consider for any v max such that 5(v max , eu(v max )) < +oo the confidence 
interval 

/("max) = 9(v max ) + 5 [v max , CZ/(t; max )] X (-1, 1). 

Let us put I(v max ) = E when 5(v max , ev(v max )) = +00. 

Let us consider the non-decreasing family of closed intervals 

^1) = n{j(iw) : %ax > fij, fi G E+. 

(In this definition, we can restrict the intersection to the support of v, since oth- 
erwise I(v max ) = R.) A union bound shows immediately that with probability at 
least 1 — 2e, m E J(v), implying as a consequence that J(v) 7^ 0. 

Proposition 3.1 Since V\ t— > J{v\) is a non-decreasing family of closed inter- 
vals, the intersection 



f|{j(^i) : V! E R+, J(ui) + 0} 



is a non-empty closed interval, and we can therefore pick up an adaptive estimator 
9 belonging to it, choosing for instance the middle of this interval. 

With probability at least 1 — 2e, m E J{v), which implies that J(v) 7^ 0, and 
therefore that 9 E J{v). 

Thus with probability at least 1 — 2e 

\m-9\< \J{v)\ < 2 inf 5(v max , eu(v max )). 

If the confidence bound S(v max , e) is homogeneous, in the sense that 

8(v max ,e) = B(e)vt, 
as it is the case in Proposition \2.4\ (page [S]) with 



B(e] 



21og(e- 1 ; 
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then with probability at least 1 — 2e, 



-9\<2 inf Bfei/Ow)]^ 



»max>» 



TTzms in the case when v is defined by equation (3.1 page 12) and |log(u/y)| < 
2s log(p), wjY/? probability at least 1 — 2e 



|m - 0| < 2p5 



2s + 1 



v. 



Let us see what happens for a sample size of n = 500, when we assume that 
|log(v/V)| < 21og(100) and we take p = 1.05. 

n = 500, v = 1 
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e, starting from 0.5, the confidence level being = 1 — 2e 



This plot shows that, for a sample of size n = 500, there are sample distribu- 
tions with a finite variance for which the deviations of the empirical mean blow 
up for confidence levels higher than 99%, where as the deviations of our adaptive 
estimator remain under control, even at confidence levels as high as 1 — 10~ 9 . 

The conclusion is that if our aim is to minimize the worst estimation error 
over 100 statistical experiments or more and we have no information on the stan- 
dard deviation except that it is in some range of the kind (1/100, 100) (which is 
pretty huge and could be increased even more if desired), then the performance of 
the empirical mean estimator for the worst sample distribution breaks down but 
thresholding very large outliers as 9 does can cure this problem. 
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4 Mean and variance estimates depending on the kurtosis 



4. Mean and variance estimates depending on the kurtosis 



Situations where the variance is unknown are likely to happen. We have seen 
in the previous section how to adapt to an unknown variance. The price to pay 
is a loss of a factor two in the deviation bound, and the fact that it is no longer 
observable. 

Here we will make hypotheses under which it is possible to estimate both the 
variance and the mean, and to obtain an observable confidence interval, without 
loosing a factor two as in the previous section. Making use of the kurtosis pa- 
rameter is the most natural way to achieve these goals in the framework of our 
approach. This is what we are going to do here. 

4.1. Some variance estimate depending on the kurtosis. In this sec- 
tion, we are going to consider an alternative to the unbiased usual variance esti- 
mate 



V = - 

t=i j=i 



j=i ' 
Yi-Yj) 2 . 



1 (v v\ 2 



n(n — 1) ^ 

i<i<j<n 

We will assume that the fourth moment E(F 4 ) is finite and that some upper bound 
is known for the kurtosis 

E[(Y -m) 4 ] 



K 



V 2 



Our aim will be, as before with the mean, to define an estimate with better de- 
viations than V. We will use k in the following computations, but when only an 
upper bound is known, k can be replaced with this upper bound in the definition 
of the estimator and the estimates of its performance. 

i 

Let us write n = pq + r, with < r < p, and let {1, . . . , n} = | | be the 

e=i 

partition of the n first intergers defined by 

{i e H;p(£- 1) < i <p£}, l<£<q, 
{«eK;p(l-l)<«<n}, £ = q. 

We will develop some kind of block threshold scheme, introducing 

1 J 

Qsifi) 



i<j 
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H 1=1 V 1 iel e v 1 *' j'e/ f 7 



where ^ is a non-decreasing influence function satisfying <\2.l \ page 5]). 

If ip were replaced with the identity, we would have W\Qg(/3) = /3v — 5. 
The idea is to solve Qs(f3) = in (3 and to estimate v by 5/(3. Anyhow, for 
technical reasons, we will adopt a slightly different definition for (3 as well as for 
the estimate of v, as we will show now. Let us first get some deviation inequalities 
for Qs((3), derived as usual from exponential bounds. It is straightforward to see 
that 



E[exp[qQ({3)]}<l[ll + (l3v-5) 



+ - 



(f3v-5) 2 + 







i=i 

2 



|/*| 2 (|/*| " l) 1 



E{[(K t -F,) 2 -2^[(F s -r t ) 2 -2^ 



i<j<Ell 

s<tei e 



We can now compute for any i ^ j 

E{ [(Xi - Y 3 f - 2v] 2 } = E[(y 4 - Yj) 4 ] - Av 2 

= 2kv 2 + 6v 2 - Av 2 = 2(k + l)v 2 , 

and for any distinct values of i, j and s, 

E{ [(Yi - Y s ) 2 - 2v] [(Yj - Y s ) 2 - 2v] } = E[(Y t - Y s ) 2 (Yj - Y s ) 2 ] - Av 2 

= E{ [(Yi - m) 2 + 2(Y % - m){Y s -m) + (Y s - m) 2 ] 

x [(Yj - mf + 2(Yj - m)(Y s - m) + (Y s - m) 2 ] | - Av 2 

= E[(Y S - m) 4 ] + e{(F s - m) 2 ^ - m) 2 + - m) 2 ] } 

+ E [(Yi - m) 2 (Yj - mf] - Av 2 = (k - l)v 2 . 

Thus 

J2 E{ [(Yi - Yj) 2 - 2v] [(Y s - Y t ) 2 - 2v] } 



s<t<=I t 



\h\i\h\ - 1)(« + \)v 2 + i^ki^i - i)(|/,i - 2)(« - iy 



r«-i) + 



/.I -i 



t; 2 . 
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It shows that 



E{exp[gQ(/3)]} < jl + 



l to o2 P 2 V* 



2 

K-1 + 



p-1 



In the same way 



E{exp[-gQ(/3)]} < jl 



2 

K - 1 + 



p-1 



2 

Let x = « — 1 H — k — 1. For any given pi, p 2 £ with probability 

p — 1 

at least 1 — 2e x , 

Q(A)<A^ + W*) 2 + ^ + 1 ^, 

2 2p q 

nfa\^a x l (R x\z x ^ y2 lo g( e L 1 ) 
Qifo) > fov - 5 - -(fov - 8) . 

2 2p q 

Let us define, for some parameter y e R, /3 as 

= — y, and let us choose y = — h °^ 1 — , and f3 2 — -, 

2p q v 

so that Q{f$?) > —y. Let us put £ = 5 — p\i> and let us choose £ such that 
Q(Pi) < —y- This implies that £ is solution of 

1 i±^f-(l + C5)£ + 2 2 /<0 where C = -■ 

z p 

Provided that (1 + £<5) 2 > 4(1 + ()y, the smallest solution of this equation is 

c = % 

1 + C5+ v/(l + C5) 2 -4(l + C)y' 

With these parameters, with probability at least 1 — 2ei, Q [(5 — 0/ v ] < —y < 
Q(S/v), implying that 

v v 



Olivier Catoni 



August 15, 2011 



4. 1 Some variance estimate depending on the kurtosis 



17 



Thus putting 

f = — — ^ 

p 



we get 



t\ V2 / e \ -1/2 



l ~5 S 



t j t ■ ■ ■ C % / 2plog(er 1 ) w t 
In order to minimize — — — , we are led to take o = \ / . We get 

<5 o V X? 



21og(er 1 ) y /2 X log(er^ 



g 5 y n — r 



C*=?. and ( 1 + C )^^<^ + 2X,0g(f ' I; 



5 q n — r 

Thus the condition becomes 



^Sio^lll + ^uJ 2 ^)' (4.2) 



PROPOSITION 4. 1 t/nder condition ( |4\2] >, wzY/j probability at least 1 - 2e x , 

1 i i z' r 

|log(u) - log(v)| <--logll- 

A simpler result is obtained by choosing £ = 2y(l + 2?/), (the values of y and 5 
being kept the same, so that we modify only the choice of (3\ through a different 
choice of £). In this case, Q((5\) < —y as soon as 

2 2 + 2C5 + C^/y _ 2# + (_5/y - 2( 
[ + V) " 1 + C ~ 1 + C 

which is true as soon as (l+2y) 2 < 2 and - > 2, yielding the simplified condition 

y 

logfer 1 ) < mini ^—^ — - j. (4.3) 

1 1 ^ " \4(1 + V2) 8 X j 

In this case, we get with probability at least 1 — 2ei that 

li f \ i r-M <- X ! A 22/(1 + 22/)^ y 
|log(f) — log(f)| < — -log 1 1 ' 



2 °V 5/5 
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Proposition 4.2 Under condition (|4.3[), with probability at least 1 — 2ei, 



|log(u) - log(u) 



< — log 
_ 2 & 



1 , M 2xlog(6^) f 1 | 41og(er 1 ) 



n — r 



2xiog(er 1 ; 



Recalling that v = k — 1 H , we can choose in the previous proposition the 

p — 1 

approximately optimal block size 



P 



n 



(«-l) [Alog(e^) + 1/2] 



Corollary 4.3 For this choice of parameter, as soon as the kurtosis (or its 
upper bound) k > 3, under the condition 



log^ 1 ) < 



?? 



1 



36(k-1) 8 : 



(4.4) 



with probability at least 1 — 2e 1 , 



\log(v) - log(v) 



^~2 l0g 



^^^(.-1)^(^)^ /4^(^ + 1/2 



n 



(k — l)n 



2(/ t -i)io g (er 1 ; 



n 



This is the asymptotics we hoped for, since the variance of (Yj — m) 2 is equal to 



— \)v 2 . The proof is page 31 



Let us plot some curves, showing the tighter bounds of Proposition 4.1| (page 
T7|), with optimal choice of p. We compare our deviation bounds with the exact 



deviation quantiles of the variance estimate of equation ( |4.1[ page [14]) applied to a 
Gaussian sample, (given by a x 2 distribution). This demonstrates that we can stay 
of the same order under much weaker assumptions. 



Olivier Catoni 



August 15, 2011 



4. 1 Some variance estimate depending on the kurtosis 



19 



n = 1000, v = 1, k = 3, p = 4 




- _ \ - 
I I I I I I I I i i 

1Q -9 1Q -8 JQ-7 1Q -6 10 -5 jg-4 1Q -3 1Q -2 



e, starting from 0.5, confidence level = 1 — 2e 




\ _ 
\ 

I I I I I I I I ' ' u 

10 -9 10 ~8 jQ-7 1Q -6 1Q -5 jg-4 1Q -3 1Q -2 jq-1 

e, starting from 0.5, confidence level = 1 — 2e 



August 15, 2011 



Olivier Catoni 



20 



4 Mean and variance estimates depending on the kurtosis 



n = 5000, v = 1, k = 6, p = 5 



T 



I I I I Mill 1 I I I Mill 1 I I I MM 



O 



c-C 

o 



o 



a, 



The non biased estimate V in the Gaussian case 

Our estimate v 



I l_l_U 



uL 



ID - 9 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10" 

e, starting from 0. 5, confidence level — 1 — 2e 



in 1 



4.2. Mean estimate under A kurtosis assumption. Here, we are going 
to plug a variance estimate v into a mean estimate. Let us therefore assume that v 
is a variance estimate such that with probability at least 1 — 2ei, 

|log(v) - log(u)| < C- 

This estimate may for example be the one defined in the previous section. Let a 
be some estimate of the desired value of the parameter a, to be defined later as a 
function of v. Let us define 9 = 9s by 



1 n 

r(0) = -^J>[a(y 4 -0)] =0, 

i=l 



(4.5) 



where ip is the narrow influence function defined by equation (2.3 page [5]). As 
usual, we are looking for non-random values 6L and 9 + such that with large prob- 
ability r{9 + ) < < r(6J), implying that 9_ < 9 < 9 + . But there is a new 
difficulty, caused by the fact that a will be an estimate depending on the value of 
the sample. This problem will be solved with the help of PAC-Bayes inequalities. 

To take advantage of PAC-Bayes theorems, we are going to compare a with 
a perturbation a built with the help of some supplementary random variable. Let 
indeed U be some uniform real random variable on the interval (—1, +1), inde- 
pendent from everything else. Let us consider 



a 



a + xa sinh ((/2)U. 
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Let p be the distribution of a given the sample value. We are going to compare 
this posterior distribution (meaning that it depends on the sample), with a prior 
distribution that is independent from the sample. Let n be the uniform proba- 
bility distribution on the interval (a [exp(— (/2) — x sinh(£/2)] , a [exp(£/2) + 

xsinh(£/2)] J. Let us assume that a and a are defined with the help of some 



positive constant c as a 
least 1 — 2ei 



and a 



. In this case, with probability at 



llog(cu) -log(a)| < 



(4.6) 



As a result, with probability at least 1 — 2e\, 

X(p,vr)=log(l+x- 1 ). 

Indeed, whenever equation ( |4.6[ ) holds, the (random) support of p is included in 
the (fixed) support of n, so that the relative entropy of p with respect to 7r is given 
by the logarithm of the ratio of the lengths of their supports. 

Let us now upper-bound ip[a(Yi — 9)] as a suitable function of p. 

LEMMA 4.4 For any posterior distribution p and any f G L 2 (p), 

^ \JpWfiP)] < f P(dP) log [l + /(/?) + \f(P? + a - Var p (/)~ , 

where a < 4.43 is the numerical constant of equation ( |7.1[ page 34). 

Applying this inequality to f(/3) = f3(Yi — 9) in the first place and f{(3) = 
(5 {9 — Yi) to get the reversed inequality, we obtain 



il>[a{Yi-0)] < 



1 



+ 



/3 2 + ^Vsinh(C/2) 2 j (Yi 



ij[a(9-Yi)}< f p(dP)log{l + P(6-Yt 



+ 



/3 2 + ^Vsinh(C/2) 2 j (Yi 



(4.7) 



(4.8) 



Let us now recall the fundamental PAC-Bayes inequality, concerned with any 
function (f3,y) f((3,y) E TLi(n <g> P), where P is the distribution of Y, such 
thatinf / > -1. 

Ejexp (^J p{dp) J2 [1 + Yi)} - n log [l + E(/(/3, Y))] - X(p, } 
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< E jy 7r(d/3) exp ^ log [1 + /(/3, Yi)] - n log [l + E(/(/3, F))] ) | = 1, 

according to [7, page 159] and Fubini's lemma for the last equality. Thus, from 
Chebyshev's inequality, with probability at least 1 — e 2 , 

/n 
p(d j 9)5>g[H-/G9 l y 4 )] 

p(d/3) log[l + E(/(ft K))] + tt) + log(e 2 x ) 

<n / p(rf/3)E(/(/3,r))+X(p,7r) + log(e 2 1 ). 
Applying this inequality to the case we are interested in, that is to equations 



< n 



( |4.7[ ) and ( |4.8[ ), we obtain with probability at least 1 — 2ei — 2e 2 that 

1 



and 



< aim — 9+) + 



a 2 + ^-^x 2 a 2 sinh(C/2) 2 j [u + (m - £+) 2 ] 



+ 



^(l + x" 1 ) +log(e 2 1 ) 



n 



ar{9) > a(m — 9 J) 



2 L 



a 2 + ^±-^x 2 a 2 sinh(C/2) 2 l [v + (m - #_) 2 ] 



logfl + ar- 1 ) +log(e 2 - 1 ; 



n 



Let us put 9 + — m = m — 9- = y/'jv and let us look for some value of 7 ensuring 
that r{9 + ) < < r(0_), implying that < < + . Let us choose 



a: 



2 


log(l+a;- 1 ) + log(e 2 1 ); 




n 


1 - 


- (a + 1) x 2 sinh(C/2) 2 ](l + 7 )w 


2 


log(l + x- 1 )+log(e 2 1 ); 




n 


[1- 


- (a + 1) £ 2 sinh(C/2) 2 ](l + 7 )£ 



ft: 



(4.9) 



assuming that x will be chosen later on such that 



^±l2x 2 sinh(C/2) 2 <l. 



Since 



log(l + x" 1 ) + log(e2 x ) ft 2 



n 



^^x 2 sinh(C/2) 2 )(l + 7 )t; 
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we obtain with probability at least 1 — 2ei — 2e 2 

r(0+) < -Jjv + — K — — - - + ^ < -Jyv + oeu(l + 7 cosh C/2 , 

2 \a a J 

.„ , , av(l + 7) fa a\ , . . , . _ , . 

and r(6-) > Jyv - + - > Jyv - av(l + 7) cosh(C/2). 

2 \a a J 

Therefore, if we choose 7 such that ^Jyv = av(l + 7) cosh(C/2), we obtain with 
probability at least 1 — 2ei — 2e 2 that r{6 + ) < < r(9J), and therefore that 
9_ < 9 < 9 + . The corresponding value of 7 is 7 = 77/(1 — 77), where 

= 2 cosh(C/2) 2 [log(l + x" 1 ) + log(e 2 ')] 
77 n[l - sinh(C/2) 2 ] 



Proposition 4.5 With probability at least l — 2e 1 — 2e 2 , the estimator 9% defined 



by equation (4.5 page 20), where a is set as in (4.9 page 22), satisfies 



\9s — ml < 



rjv 

1—7] 



< 



TjV 



V 



exp(C/2) < 



TjV 



exp(C). 



The optimal value of x is the one minimizing 

log(l + 



iog(e 2 - 1 ; 



^^x 2 sinh(C/2) 2 



Assuming C, to be small, the optimal x will be large, so that log(l + x l ) ~ x~ 
and we can choose the approximately optimal value 



x 



2(a + l) 



log(e 2 1 ) 



-1/3 



sinh(C/2) 



-2/3 



Let us discuss now the question of balancing e\ and e 2 . Let us put e = e 1 + e 2 
and let y = ei/e. Optimizing y for a fixed value of e could be done numerically, 
although it seems difficult to obtain a closed formula. However, the entropy term 
in r] can be written as log(l + x~ l ) + log(e _1 ) — log(l — y). Since ( decreases, 
and therefore the almost optimal x above increases, when y increases, we will get 
an optimal order of magnitude (up to some constant less than 2) for the bound if 
we balance — log(l — y) and log(l + x^ 1 ), resulting in the choice y = (1 + x) _1 , 
where x is approximately optimized as stated above (this choice of x depends on 
y, so we end up with an equation for x, which can be solved using an iterative 
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approach). This results, with probability at least 1 — 2e, in an upper bound for 
\9 — m\ equivalent for large values of the sample size n to 



21og(e- 1 )w 



n 



Thus we recover, as desired, the same asymptotics as when the variance is known. 

Let us illustrate what we get when n = 500 or n = 1000 and it is known that 

k < 3. 

On these figures, we have plotted upper and lower bounds for the deviations 
of the empirical mean when the sample distributiorj^] is the least favourable one 
in 23 1>/t . (These bounds will be proved later on. The upper bound is computed by 
taking the minimum of three bounds, explaining the discontinuities of its deriva- 
tive). What we see on the n = 500 example is that our bound remains of the same 
order as the Gaussian bound up to confidence levels of order 1 — 10 -8 , whereas 
this is not the case with the empirical mean. 

In the case when n = 1000, we see that our estimator possibly improves on the 
empirical mean in the range of confidence levels going from 1 — 10~ 2 to 1 — 10~ 6 
and is a proved winner in the range going from 1 — 10~ 6 to 1 — 10~ 14 . 

n = 500, v = 1, k = 3, p = 3 



e 



"5, 



s 



I III ». 1 1 1 1 1 ryiTTTTTj 1 I I lllllj 1 I I I llllj 1 I I I llllj 1 I I I llllj 1 I I I llllj 1 I I I llllj 1 IT 

\\ \ \ 

\ » I A^/V— m\ lower bound, worst sample in B\. K 

V \ ""^^M — *m\ upper bound, worst sample in Bi fK 

\ \ \ \ 

\ \ |(?^- m\ oonf. int. upper bound, worst sample in Bi H 

\ \ ~ \ \ 

\^ \ \0 — y7\j^corjp\ int. typical value, worst sample in B\ iK 

V* \B — m\ up^cr bound, worst sample in B\ K = 

\ ~ 

\0 a — ml upper 'hound, worst sample in A\ 
'^A/^ml upper bound, 1 6^ussian sample 







I 



uL 



in 



10" 8 



10 -7 1Q -6 1Q -5 1Q -4 jq-3 jg- 

e, starting from 0. 5, confidence level = 1 — 2e 
Si. K is defined by ( 1.2 page[3j) 
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e 
a 



o 
"a, 



o 



1 1 1 [ 



n = 1000, v = 1, k = 3, p = 3 

— T 



T 



i — r 



\M — m\ lower boun&. worst Sample in B\^ 

\M — m\ upper bound;, worst sample in B\ ;K 

\8 — m\ conf. int. upper bound, Vvorst sample in B\, h 

*\ \8 — m| conf. int. typicaj value, wprst sample in B\ K 

^ '. • 

N „ . * 

s s \8 — m\ upper bound, wofcst sample*»in Bi K 

N \ » 

"^>^J0a — m\ upper bound, wo^st sample fla ,4i 

,^ ~~ \M — upper bound, Gaussian sample\ 



_l I I I 

10 _15 10 -14 1Q -13 1Q -12 



_L 



_ I I I L_ 

10 _11 10 _10 jg-9 1Q -8 1Q -7 10 -6 1Q -5 



J L 



J_^l 

iu - iu - 1U 1U 1U iu iu 10" 3 10" 2 10" 1 

e, starting from 0.5, confidence level = 1 — 



2e 



Let us see now the influence of k and plot the curves corresponding to increas- 
ing values of n and k. 



S 1 



-2 
"a, 



n = 1000, v = l, k = 6, p = 3 


1 1 1 


i \ \ \ 1 1 1 1 1 1 
i \ \ \ 

M \- m\ lower bound, worst sample in B\ mK 

\ \ \ \ 

M -\ m\ upper\bound, w-prst sarfjplc in B\ K 






~ \ , \ \ 

8 — ra,| conf. intAuppcr b'ound, w^rst sample in Bi iK 

~ * \ \ 

8 — mfuxmf. int. t\\r)ical vqlue, wor&t sample in Bi iK 
\ \ 

8 — m\ lirjper bound, worst Sample in\£»i iK 


- 


\ \ \ 

) n — m\ up\icr bound, wd^t sample in 


1 1 1 


^^^^ 





1Q -15 1Q -14 jQ-13 10 -12 10 -11 



10 _11 10 _10 1Q -9 1Q -8 1Q -7 1Q -6 1Q -5 

starting from 0.5, confidence level = 1 



10 -4 10 -3 
2e 



10" 



10" 1 
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n 



10000, v 



60, p 



i — i — i — ttti — i — i — i — i — r 

\M — m\ fcwef bound, worst sample in Bi. K 

\M — m\ u^pcf, bound, worst sample in B\ fK 

\B — m\ conr\ inl?, upper bound, worst sample in 

\0 — m\ conf. 'lint, '.typical value, worst sample in B\_ H 

\0 — m\ upper poun\l, worst sample in 5i )K 

\0 a — m\ upper feounfl, worst sample in Ai 

\M — m\ upper bdsmd, 'Gaussian sample 



1 — T 



— \ 



J I I I 



J L 



10 -15 1Q -14 1Q -13 1Q -12 10 -11 10 -10 W -9 W -S 1Q -7 1Q -6 



1Q -11 10 -10 1Q -9 1Q -8 1Q -7 1Q -6 1Q -5 1Q - 

starting from 0.5, confidence level = 1 — 



1(T 3 10" 2 10" 
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When we double the kurtosis letting n = 1000, we follow the Gaussian curve up 
to confidence levels of 1 — 10~ 10 instead of 1 — 10~ 14 . This is somehow the maxi- 
mum kurtosis for which the bounds are satisfactory for this sample size. Looking 
at Propo sition |4 . 2 1 (page [T8|) , we see that the bound in first approximation depends 
on the ratio \l n — K / n (when p = 3), suggesting that to obtain similar perfor- 
mances, we have to take n proportional to k, which gives a minimum sample size 
of n = 1000k/6, if we want to follow the Gaussian curve up to confidence levels 
of order at least 1 — 10~ 10 . 

These curves suggest another approach to choose the kurtosis parameter k. 
It is to use the largest value of the kurtosis with a low impact on the bound of 
Proposition |4.5| (page [23]), given the sample size. This leads, when in doubt about 
the true kurtosis value, for sample sizes n > 1000, to set, according to the pre- 
vious rule of thumb, the kurtosis in the definition of the estimator to the value 
K max = 6n/1000. Doing so, we get almost the same deviations as if the sample 
distribution were Gaussian, at levels of confidence up to 1 — 10~ 10 , for the largest 
range of (possibly non-Gaussian) sample distributions. 



5. Upper bounds for the deviations of the empirical mean 

In the previous sections, we compared new mean estimators with the empirical 
mean. We will devote the end of this paper to prove the bounds on the empirical 
mean used in these comparisons. 

This section deals with upper bounds, whereas the next one will study corre- 
sponding lower bounds. 

Let us start with the case when the sample distribution may be any probability 
measure with a finite variance. It is natural in this situation to bound the deviations 
of the empirical mean 



n 

n 

i=i 

applying Chebyshev's inequality to its second moment, to conclude that 



M = 

n 
i=i 



P( \M-m\ > \I^J < 2e. (5.1) 

This is in general close to optimal, as will be shown later when we will com- 
pute corresponding lower bounds. 

When the sample distribution has a finite kurtosis, it is possible to take this into 
account to refine the bound. The analysis becomes less straightforward, and will 
be carried out in this section. The following bound uses a truncation argument, 
allowing to study separately the behaviour of small and large values. It is to our 
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knowledge a new result. We will show later in this paper that its leading term is 

1/4 



essentially tight — up to a factor , — when the proper asymptotic is 

\K-1J 

considered. 

Proposition 5.1 For any probability distribution whose kurtosis is not greater 
than k, the empirical mean M is such that with probability at least 1 — 2e, 



\M-m\ < . nf / 21og(A~ 1 e~ 1 ) - v /«log(A~ 1 e~ :L 



+ 



Ae(o,i) V n 3n 

-l^-l\2-\ V4 , ^ s 1/4 



V /4 I" 3(n- l)Klog(A-V 
■ ) [ + 4 3 (1 + y/2)±n 2 



.2(1 - A)n 3 e 

Instead of minimizing the bound in A, one can also take for simplicity 

■ 1 

A = mm 




We see that there are two regimes in the behaviour of the deviations of M. A 
Gaussian regime for levels of confidence less than 1 — 1/n and long tail regime 
for higher confidence levels, depending on the value of the kurtosis k. 

In addition to this, let us also put forward the fact that, even in the simple case 
when the mean m is known, estimating the variance under a kurtosis hypothesis 
at high confidence levels cannot be done using the empirical estimator 



1 n 



M, = - > (Yi-m 



Indeed, assuming without loss of generality that m = and computing the quadratic 
mean 



e{ [m 2 - E(y-)] 2 } = E < r< ' ; E ' y2 > 2 = ^l E( y •)■, 

we can only conclude, using Chebyshev's inequality, that with probability at least 

1 -2e 

E(r 2 )<- M2 



AC - 1 

2ne 

AC- 1 

a bound which blows up at level of confidence e = , and which we do 

2n 

not suspect to be substantially improvable in the worst case. In contrast to this, 
Propositions |4.1| and |4.2| (page [T8| ) provide variance estimators with high confi- 
dence levels. 
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6. Lower bounds 



6.1. Lower bound for Gaussian sample distributions. This lower 
bound is well known. We recall it here for the sake of completeness. 

The empirical mean has optimal deviations when the sample distribution is 
Gaussian in the following precise sense. 

PROPOSITION 6.1 For any estimator of the mean 9 : R n — > R, any variance 
value v > 0, and any deviation level rj > 0, there is some Gaussian measure 
N(m,i>) (with variance v and mean m) such that the i.i.d. sample of length n 
drawn from this distribution is such that 

F(0>m + r])>F(M>m + r]) or P(6< m - rj) > P(M < m - rj), 
1 n 

where M = — y Yi is the empirical mean. 

n 

i=i 

This means that any distribution free symmetric confidence interval based on 
the (supposedly known) value of the variance has to include the confidence in- 
terval for the empirical mean of a Gaussian distribution, whose length is exactly 
known and equal to the properly scaled quantile of the Gaussian measure. 

Let us state this more precisely. With the notations of the previous proposition 



P(M > m + rj) = P(M < m - rj) 

= G 



n 

rj, +oo 



= ^-F[ X l-rj 



where G is the standard normal measure and F its distribution function. 
The upper bounds proved in this paper can be decomposed into 

P(9>m + rj)<e and V(9 < m - rj) < t, 

although we preferred for simplicity to state them in the slightly weaker form 
F(\6-m\ >rj)< 2e. 

As the Gaussian shift model made of Gaussian sample distributions with a 
given variance and varying means, is included in all the models we are considering 
in this paper, we necessarily should have according to the previous proposition 

which can be also written as 

rj 
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6 Lower bounds 



Therefore some visualisation of the quality of our bounds can be obtained by 

plotting e i — 77 against e (->■ \ I — — e), as we did in the previous sections. 

V n 

Let us remark eventually that the assumed symmetry of the confidence region 
is not a real limitation. Indeed, if we can prove for any given estimator 9 that for 
any Gaussian sample distribution with a given variance v, 



P 

and P 



m > 9 + 77+ (e) 
m < 9 — 77_ (e) 



< e, 



then we may consider for any value of e the estimator with symmetric confidence 
levels defined as 

9 s = 9+ r]+{e) - 7] - {e) . 
2 

This symmetric estimator is such that for any Gaussian sample distribution with 
variance v. 



P 
P 



m > 9 S + 



m < 9 q — 



77_(e) +Tj + {e) 



< e. 



Thus, applying the previous proposition, we obtain that 



> 



F-Hl-e) 



6.2. Lower bound for the deviations of the empirical mean de- 
pending ON the variance. In the following proposition, we state a lower 
bound for the deviations of the empirical mean when the sample distributiorj^] is 
the least favourable in A Vm ^ (meaning the distribution for which the deviations of 
the empirical mean are the largest). 

Proposition 6.2 For any value of the variance v, any deviation level rj > 0, 
there is some distribution with variance v and mean such that the i.i.d. sample 
of size n drawn from it satisfies 



P(M > 77) = P(M < 



U„ mll is defined by page|3jl 



-v) > 



v 1 



n-l 



2nr] 2 
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Thus, as soon as e < (2e) \ with probability at least 2e, 



\M -m\ > a 












J 2ne 





Let us remark that this bound is pretty tight, since, according to equation (5.1 



page 27 ) with probability at least 1 — 2e, 



\M - ml < 



v 

2ne' 



This can also be observed on the plots following Proposition |2 .4| (page |8j) . 

6.3. Lower bound for the deviations of empirical mean depending 
on the variance and the kurtosis. Let us now refine the previous lower 
bound by taking into account the kurtosis k of the sample distribution, assuming 
of course that it is finite. 

Proposition 6.3 As soon as e _1 > n > 16, there exists a sample distribution 
with mean m, finite variance v and finite kurtosis n, such that with probability at 
least 2e, 



\M — ml > max< 



(«-l)(l-8e) 
4ne 



1/4 



"(K-l) 












2ne 







1/4 



'log[l6/( 



ne: 



2n 



Let us remark that the asymptotic behaviour of this lower bound when ne and 



log(e both tend to zero matches the upper bound of Proposition 5.1 (page 

28) up to a multiplicative factor (^zj) 1 ^ 4 < 1 11 when the kurtosis is k > 3, 
which is the kurtosis of the Gaussian distribution. 



The plots following Proposition 4.5 (page p3J) show that this lower bound is 



not too far from the upper bound obtained by combining Proposition 5.1 (page 
28]), Proposition [73] (page [37]) and equation <\5.l\ page [27]) . 



7. Proofs 



7.1. Proof OF Corollary 4.3 (page 18). Let us remark first that condition 
(|4.4[ page[T8~|) implies condition (|4.3[ page|17[), as can be easily checked. Putting 
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7 Proofs 



/ 71 

x = J -f ; — 7- -^r, so that p = I x I , we can also check that p— 1 > 

V(*-l)[41og(er l ) + l/2]' 

x/2 and n — p + 1 > n/2. We can then write 



^xiogfcr 1 ) ^ | 4iog(er 1 ; 



n — r) \ q 



<t /2( K -i)iog( £r -) ex y 1 



n \ (k — l)(p — 1) 

p - 1 41 og(er 1 )p 

2(n — p + 1) n — p + 1 



+ 



< 2(« - 1) Mer 1 ) / 2 | 2x[41og(er 1 ) + 1/2] \ 
~ n \ (k — l)x n J 



2(«-l)log(er 1 ) T /41og(e L VM/2 
ft I V ( K — l)n 



7.2. Proof of Lemma 4.4 (page 21 ). Let us introduce some modification 



of ?/> in order to improve the compromise between inf ip" and sup ip. Let us put 
ip(x) = log(l + x + x 2 /2). We would like to squeeze some function x between ip 
and ip, in such a way that inf x" = inf ip . This will be better than using ip itself 
since 



inf^ =-1/4 
whereas inf ■*//' = —2. 

Indeed these two values can be computed in the following way. Let us put (p(x) = 

exp[ip(x)] = 1 + x + x 2 /2. It is easy to check that 

4>'(x) = -<p{x)- 1 [l-<p{x)- 1 ], 
j/'(x) = <p(-x)- 1 [l-<p(-x)- 1 ], 

implying that ip \x) > —1/4. This inequality becomes an equality when (p(x) = 
2, that is when x = \/3 — 1 ~ 0.73. In the same way ip"(x) > — 2 and equality is 
reached when (p(—x) = 1/2, that is when x — 1. We are going to build a function 
X which follows ip when x < X\, where x\ satisfies ip"(x{) = —1/4. The value 
of x\ is computed from the equation </?(— xi) -1 = (1 + \/2)/2. Let y\ = ip{x\) 
and p\ = ip'(x\). They have the following values 



X\ 



1 - V4\/2-5~ 0.1895 



Olivier Catoni 



August 15, 2011 



7.2 Proof of Lemma 4.4 (page 21 ) 



33 



2/i 
Pi 



-log[2(v^- 1)] ^ 0.1882, 
y/Ay/2- 5 



0.978. 



2(>/2-l) 

After xi, we continue x with a quadratic function, until its derivatives cancels. 
Thus, the second derivative of % being less than the second derivative of ip at each 
point of the positive real line, we are sure that x{ x ) < ^P(x) for any x G R. The 
function x built in this way satisfies the equation 

(ip(x), x < Xi, 

V\ +pi(x- Xi) — -(a; — xi) 2 , % < x < x x + 4p u 
Vi + 2pl < 2.103, x>x 1 + Ap 1 . 

As we have proved, and as can be seen on the next plot, the function \ 1S suc h that 

ip(x) < x( x ) < 4>(. x ), x G R. 
Plot of x an d ip 



s 
a 

X 




Let us now compare x with a suitable convex function (in order to apply Jensen's 
inequality). Let us introduce to this purpose the function x x , = x( x ) + — x *) 2 , 
which is convex for any choice of the parameter x* G R. 

Let us consider as in the lemma we have to prove some function / G L 2 (p) 
and let us choose x* = J p(d(3)f((3) and put j p(d/3) [/(/?) — / pf] 2 = Var p (/). 
We obtain by Jensen's inequality 
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V>0&*) < xM = xM = x x , [fp(df3)f((3)] < Jp(dl3)Xx. [/(/?)] 



Jp(d/3) X [/(/?)] +-Var p (/). 



On the other hand, it is clear that 



V>(0 <Jp(dl3)x[m] - inf X + sup ^ = /pWx [/(/?)] +log(4). 



We have proved 



LEMMA 7.1 For any posterior distribution p and any f G ^(p), 



rl>[fp(d0)f(P)] < Jp(dP)x[f(P)] +min{log(4),-Var p (/) 



To end the proof of Lemma 4.4 (page 21 ), it remains to establish that for any 

x G R and y G R+, 



x(x) + min< log(4), - > < log 1 + x + x /2 + 



where 



3exp[sup(x)] _ 3exp(yi + 2p\] 



41og(4) 



41og(4) 



< 4.43. 



(7.1) 



Indeed a should satisfy 

2 " 



a > 



V 



exp min< 4, exp 



l + x + 



x- 



x G R, y G R+. 



Since exp < 1 + x + — , the right-hand side of this inequality is less than 

2exp[x(x)] / . r mi 
(min(4,exp(-)j- 



As y h-> y 1 exp^-j — 1 is increasing on R + , this last expression reaches 

its maximum when x G argmaxx and exp^j = 4, and is then equal to 

3 exp [sup (x)] 



41og(4) 



which is the value stated for a in equation ( |7.1[ ) above. 
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7.3. Proof OF Proposition 5.1 (page 28). Consider the function 



x — log(l + x + x 2 /2), x>0, 
x + log(l — x + x 2 /2), x<0. 



This function could also be defined as g(x) = x — ip(x), where ip is the wide 
version of the influence function defined by equation (2.2, page[5]). It is such that 



g'{x) 



X 



< 



X 



2(l + x + x 2 /2) ~ 2 



Therefore 



< g(x) < 



x 



6 ' 



implying by symmetry that 



x > 0. 



x > 0, 



\g(x)\ < 



\x\ 



xeR. 



We can also remark that 



implying that 



9'{x) < 



\g(x)\ < 



X 



2(1 + V^)' 



x 



4(1 + ^) 

As it is also obvious that \g(x)\ < \x\, we get 



x > 0, 



x e R. 



Lemma 7.2 



Now let us write 



\g(x)\ < min< \x\, 



x 



\x\ 



4(1 + V2)' 6 J' 



1 n i n 

M = m + — V^fate-m)! + — V G { 
an z — ' an z — ' 

i=l i=l 



where Gi = g [a(Yi — m)] and ip is the wide influence function of equation (2.2 
page [5]). As we have already seen in Proposition |2.2| (page |7j), with probability at 
least 1 — 2e 1 , 

1 " 



m 



na 



i=i 



— 2 
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On the other hand with probability at least 1 — e 2 



< E (\G\) ] 



E 



a 



L 5> <-5> 

i=l i=l 

Let us put ifj = | Gi| and let us compute 

4' 



£|Gi|-E(|G|) 



i=i 



1/4 



1/4 

nae 2 



-E 



n 



^#i-E(#) J = E{[#-E(#)] 4 }+3(n-l)E{[#-E(ff)] 2 }' 
i=i ' -I 

= E(H 4 ) - AE(H 3 )E(H) + 6E(H 2 )E(H) 2 - 3E(H) 4 

+ 3(n - 1) [E(# 2 ) 2 - 2E(# 2 )E(#) 2 + E(#) 4 ] 

< E(# 4 ) + 2E(H 2 )E(H) 2 - 3E(F) 4 

+ 3(n - 1) [E(# 2 ) 2 - 2E(iJ 2 )E(i/) 2 + E(iJ) 4 ] 

< E(i/ 4 ) + 3(n - 1)E(# 2 ) 2 - (3n - 2)E(#) 4 



< a V + 3(n - 1) 



/t 2 a 8 f 4 



Moreover 



E(if) < ^E(|F-m| 3 ) < ^v 7 ^ 3 



[4(1 + v^)]' 



(3n - 2)E(#) 4 . 



6 



Thus with probability at least 1 — 2ei — e 2 , 

M-m < — + 6V 1 ; 

2 na 



+ sup ^ +n - 3 / 4 a- 1 e 2 1/4 
av log(ej' 1 ) V /tv 3 a 



4 2 3(n - 1)k 2 «V , s 4 
kq; f H ; -^—^ (3n — 2)x 



-,1/4, 



[4(1 + V^)]' 



< h 

2 na 



+ 



+ 



V I K 



1/4 



Let us take 



n 3 / 4 \e 2 



21og(er 1 ) 



nv 



3(n — 1)kq. a v 2 
[4(1 + v^)]' 



1 + 



1/4 



and let us put e\ = Ae and e 2 = (1 — A)2e. 

The bound can either be optimized in A or we can for simplicity choose A to 
balance the following factors 



2v lo 



n 



1/4 . 

n 3 / 4 V2ey 4' 
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This leads to consider the value 



1 _ /_ . / K \ ( 2ne 



A = nrfn<-,2^21ot V2 ft 



1/4 



As stated in Proposition [5T] (page [28]), with probability at least 1 — 2e, 



\M -m\< 



+ 



2ulog(A l e x ) y/nv\og(X e 



U-i N 



n 

K 



3n 



n \2(1 - A)tw 



1/4 



3(n- l)«log(A- 1 e- 1 ) 2 



1 + 



[4 3 (1 + y/2) 



4 n 2 



1/1 



- A^f— v /4 . 

e^o V n \2neJ 

Another bound can be obtained applying Chebyshev's inequality directly to 
the fourth moment of the empirical mean, which however does not reach the right 
speed when e is small and n large. 

PROPOSITION 7.3 For any probability distribution whose kurtosis is not greater 
than k, the empirical mean M is such that with probability at least 1 — 2e, 



\M - m\ < 



3(n - 1) + k 
2ne 



i/i 



PROOF. Let us assume to simplify notations and without loss of generality 
that E(Y) = 0. 

i=l i<j 

It implies that 

/, , \ E(M 4 ) [3(n - 1) + k]v 2 
P( M-m > V )< -^-1 < ^ 3 \ 1 , 



and the result is proved by considering 2e 



[3(n - 1) + «]-. 

n 3^4 



□ 



In our comparisons with new estimators, we took the minimum over the three 



bounds given by Proposition ^ .l| (page|28]), Proposition |7.3| (page[37|) and equation 
([511 page [27]) . 
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7.4. Proof of Proposition 6.1 (page 29). Let us consider the distributions 



Pi and P 2 of the sample (Yi)f =1 obtained when the marginal distributions are 
respectively the Gaussian measure with variance v and mean m x = — n and the 
Gaussian measure with variance v and mean m 2 = r\. We see that, whatever the 
estimator 9, 

P x (0 > mi + rj) + F 2 (9 < m 2 - rj) = Pi(0 > 0) + P 2 (0 < 0) 

> (Pi A P 2 )(0 > 0) + (Pi A P 2 )(0 < 0) > |Pi A P 2 |, 

where F 1 A P 2 is the measure whose density with respect to the Lebesgue measure 
(or equivalently with respect to any dominating measure, such as Pi + P 2 ) is the 
minimum of the densities of Pi and P 2 and whose total variation is |Pi A P 2 |. 

Now, using the fact that the empirical mean is a sufficient statistics of the 
Gaussian shift model, it is easy to realize that 

|Pi A P 2 | = Pi(M >m 1 + 7]) + P 2 (M < m 2 - 77), 

which obviously proves the proposition. 



7.5. Proof of Proposition 6.2 (page 30). Let us consider the distribution 



with support {— nr], 0, nr]} defined by 

P(W) =F({-n V }) = [l-P({0})]/2 

It satisfies E(Y) = 0, E(Y 2 ) = v and 

P(M > rj) = P(M < -rj) > F(M = rj) = — % 

2ni] 2 



2n 2 rf 



1 - 



n 2 rf 



n-l 



6.3 



(page 31). Let us consider for Y the follow- 



7.6. Proof of Proposition 
ing distribution, with support {—nr], — £, £, nr]}, where ^ and 77 are two positive 
real parameters, to be adjusted to obtain the desired variance and kurtosis. 



In this case 



P(Y = -nn) = F(Y = nr]) = q, 
F(Y = -0 = F(Y = = \ - q. 



m = 0, 

v = (1 - 2q)£ 2 + 2qn 2 r] 2 , 
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Thus 
where 



/,(*) 



kv 2 = (1 - 2g)£ 4 + 2qn 4 r] 4 . 
1 - 2g + 2gx 4 



, 2 ■ 



X > 1. 



(l-2g + 2gx 2 )' 

It is easy to see that f q is an increasing one to one mapping of (1, +oo( into itself. 
We obtain 



Consequently 



and r] 



2gy n — ^ — \ 2g y n 



l-2q + 2q[f-\K 



< Vv 



n 



1/4 



On the other hand, 

P(M > »7 - 7) > it F ( Yi = n ^ \ E ^ ^ G ^ *) 

= riP(Yi = nr])(l - 2q) n - 1 



Let us remark that 



P (^X> > 7 I ^ e {-£,£},2 < j < 

< inf cosh(A£/W)™ 1 exp(— A7) < inf exp^-^ A7^ 



ri7 2 \ / ri7 2 

( ' xl>i -2eJ- exp l-^r 



Also, by symmetry 



P -V^>7 



e{-e,0>2<i<n < 
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8 Generalization to non-identically distributed samples 



Thus, as m = 0, 



777 



P(|M — m | > r] — 7) > 2ng(l — 2g) n 1 max! -, 1 — exp 



Let us put 



X = mm<; -,exp 



and let us assume that e < — . Let us put 

16 



4e 



nil -x)\ n { 1 ~ X) 



717 

'^7 



-(n-l) 



< 



2e 



n(l - x)' 



the last inequality being a consequence of the convexity of 2 i-)- (1— a;) n 1 > 
1 — (n — l)x > 1 — nx, < x < 1. Let us remark then that 



P(|M-m| > 77-7) > 



2e(l - 2g) 



n-l 



1 



4e 



n(l - x) 



n-l 



> 2e. 



Thus, with probability at least 2e, 

'(^-l + 2e/n)(l- X ) nl/4 



M -m > 



2ne 



1 \ (n-l)/4 

4e \ /f 



n(l - x) 



2\og(x *)v 



> sup 

0<X<l/2 



(k- l + 2e/n)(l-x-4e) 
2^ 



7?. 
1 1/4 



r / 21og( X - 1 )l(x< 1/2)7; 
77 v n 



In order to obtain Proposition 6.3 (page 31 ), it is enough to restrict optimization 
with respect to x to the two values 

H 1/4 H 1 

X = — — and x = -■ 



8. Generalization to non-identically distributed samples 



The assumption that the sample is identically distributed can be dropped in 
Proposition |2.4| (page[8|). Indeed, assuming only that the random variables (5^)™ =1 
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are independent, meaning that their joint distribution is of the product form <S)™ =1 Pi, 
we can still write, for Wi = ±a(lj — 9), 



E<|exp 



L i=l 



Wt 



£log(l + W- + 
f n 



i=l 



l+E(Wi) + 



E(W?) 



< 



f r i n i n ii 

exp nlog 1 + -£e(^) + -£e(W?) . 
L L i=1 j =1 J ) 



Starting from these exponential inequalities, we can reach the same conclu- 
sions as in Proposition |2 .4| (page |8|) , as long as we set 



m 



and v 



1 n 

i=l 



m) 2 ]. 



i=l 



1 ™ 

We see that the mean marginal sample distribution — >^ Pj is playing here the 

n ^— ' 

same role as the marginal sample distribution in the i.i.d. case. 



9. Experiments 



9.1. Mean estimators . Theoretical bounds can explore high confidence lev- 
els better than experiments, and have also the advantage to hold true in the worst 
case. They have led us to introduce new M-estimators, and in particular the one 
described in Proposition |4.5 (page 23). Nonetheless, they may be expected to be 
rather pessimistic and are clearly insufficient to explore the moderate deviations 
of the proposed estimators. In particular it would be interesting to know whether 
the improvement in high confidence deviations has been obtained as a trade-off 
between large and moderate deviations (by which we mean a trade-off between 
the left part and the tail of the quantile function of \9 — m\). 

This is a good reason to launch into some experiments. We are going to test 
sample distributions of the form 
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9 Experiments 



where d G {1,2,3}, (pi)i=i,...,d is a vector of probabilities and ]\f(m, a 2 ) is as usual 
the Gaussian measure with mean m and variance a 2 . To visualize results, we have 
chosen to plot the quantile function of the deviations from the true parameter, that 
is the quantile function of \9 — m\, where 9 is one of the mean estimators studied 
in this paper. 

Let us start with an asymmetric distribution with a so to speak intermittent 
high variance component. Let us take accordingly 



Pi 

P2 
P3 



0.7, mi = 2, 0i = 1, 

0.2, m 2 = -2, a 2 = 1, 

0.1, m 3 = 0, cr 3 = 30. 

In this case, m = 1, k = 27.86 and v = 93.5, so that, when the sample size n 
is in the range 100 < n < 1000, the variance estimates we are proposing in this 
paper are not proved to be valid. For this reason we will challenge the emp irical 



mean with the two following estimates : the estimate 9 a of Proposition |2.4| (page 
[sl (where a is chosen using the true value of v) and a naive plug-in estimate, 9q, 
where a is set as was a, r eplacing the true variance v with its unbiased estimate 
V given by equation ( |4.l| page 14). The parameter e is set to e = 0.05 for both 
estimators, targeting the probability level 1 — 2e = 0.9. 

We will plot also the sample median estimator, in this case where the distribu- 
tion median is different from its mean, to show that robust location estimators for 
symmetric distributions do not apply here. 

When the sample size is n = 100, we obtain the following results (computed 

from 1000 experiments). 

sample size n = 100, number of experiments : 1000 



L 1 1 1 I 



s 



T 



T 



T 



T 



t — i — r 



~i — i — r 



t — i — r 



T 




\M - m\ 

\0 a — m\. known variance 

|#~— m\, unknown variance 

deviations of the empirical median from m 



J I I I I I I I I I l_l I I I l_J I l_l I I I I I I l_l LJ L 



0.1 0.2 0.3 0.4 0.5 0.6 

probability levels 



0.7 



0.8 



0.0 



1.0 
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In this first example, the new M-estimators have uniformly better quantiles than 
the empirical mean, at any probability level. Moreover the variance can be harm- 
lessly estimated from the data when it is unknown. Thus, in this case, the empiri- 
cal mean is outperformed from any conceivable point of view. 



Let us now increase the sample size to n = 1000. 



sample size n = 1000, number of experiments : 1000 
t — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — i — r 

\M-m\ \ 




probability levels 

As should be expected, the values of the three estimators get close for this larger 
sample size (whereas it becomes more obvious that the empirical median is esti- 
mating something different from the mean). 



The empirical mean can still be challenged for this larger sample size, but for 
a different sample distribution. To illustrate this, let us consider a situation as 
simple as the mixture of two centered Gaussian measures. Let d — 2 and 

pi = 0.99, mi = 0, ffi = 1, 

p 2 = 0.01, m 2 = 0, a 2 = 30. 
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Here k = 243.5, v = 9.99 and m = 0. We take e = 0.005, targeting the 
probability level 1 - 2e = 1 - 10/n = 0.99. 

sample size n = 1000, number of experiments : 1000 
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Let us show some heavily asymmetric situation where the left-hand side of the 
quantile function of the new estimators does not improve on the empirical mean. 
In what follows k = 33.4, v = 72.25, m = —1.3, and the mixture parameters are 

Pi = 0.94 m 1 =0, <7i = 1, 

p 2 = 0.01, m 2 = 20, s 2 = 20, 

p 3 = 0.05, m 3 = -30, s 3 = 20. 

We plot below two estimators using the value of the variance, optimized for the 
confidence level 1 — 2e with e = 0.05 and e = 0.0005 respectively (the estimators 
with unknown variance show the same behaviour). 

Here, choosing a moderate value of the estimator parameter e is required to 
improve uniformly on the empirical mean performance, whereas higher values of e 
produce a trade-off between low and high probability levels. Whether this remains 
true in general would require to be confirmed by more extensive experiments. 
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sample size n = 100, number of experiments : 1000 
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Let us end this section with a Gaussian sample. 

sample size n = 1000, number of experiments : 1000 
i — i — | — i — i — i — | — i — i — i — | — i — i — i — | — i — i — i — | — i — i — i — | — i — i — i — | — i — i — i — | — i — i — i — | — i — i — r 



1.0 



I 



O o 

e o 



£ d 

s 



13 o 




I 1 1 1 I 

\M-m\ 

— m| , known variance 

|^ — m|, unknown variance 

deviations of the empirical median from m 



1 I l_l I I L_1_J I I I I LJ l_J I L 



I ... I 



l.i.l 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 

probability levels 

When the sample is Gaussian, as could be expected, our new M-estimators 
coincide with the empirical mean. What we obtained for a sample size n = 1000 
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could also be observed for other sample sizes. The deviations of the empirical 



median are higher in the Gaussian case, as proved in Proposition 6.1 (page 
(stating that the deviations of the empirical mean of a Gaussian sample are opti 
mal). 



9.2. Variance estimators. Let us now test some variance estimates. This 

page 



is an example where the usual unbiased estimate V defined by equation (4.1 
[T4| ) shows its weakness. To demonstrate things on simple sample distributions, 
we choose again a mixture of two Gaussian measures 



Pi = 0.995, 
p 2 = 0.005, 
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Here n = 10.357, v = 1.125 and m = 0.005. To be in a situation where the 



variance estimates of Proposition 4.1 (page 17) work at high confidence levels, 
we choose a sample size n = 2000, and use in the estimator the parameters 
K ma X = 6 * n/1000 = 12, p — 2 and e = 0.0025 (targeting the probability 
level 1 - 2e = 1 - 10/n = 0.0995). 

sample size n = 2000, number of experiments : 1000 
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So, for the variance as well as for the mean, there are simple situations in which 
our new estimators perform better in practice than the more traditional ones. 
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9.3. Computation details. To compute the estimators in these experiments, 
we used the two following iterative schemes (performing two iterations was enough 
for all the examples shown). 

#o = M, 6 k+1 = r(9 h ) + e k , 

R - 6 ~ y 8 - 6 ~ y R 

V Q(Pk) + d 

They are based on two principles: they have the desired fixed point and their right- 
hand side would be independent respectively of 9% and /3 k if ip was replaced with 
the identity. The fact that if) is close to the identity explains why the convergence 
is fast and only a few iterations are required. 

These numerical schemes involve only a reasonable amount of computations, 
opening the possibility to use the new estimators in improved filtering algorithms 
in signal and image processing (a subject for future research that will not be 
pushed further in this paper). 



10. Conclusion 

Theoretical results show that, for some sample distributions, the deviations 
of the empirical mean at confidence levels higher than 90% are larger than the 
deviations of some well chosen M-estimator. Moreover, in our experiments, based 
on non-Gaussian sample distributions, the deviation quantile function of this M- 
estimator is uniformly below the quantile function of the empirical mean. The 
improvement of the confidence interval at level 90% can be more than 25%. 

Using Lepski's adapting approach offers a response with proved properties to 
the case when the variance is unknown. For sample sizes starting from 1000, an 
alternative is to use an M-estimator of the variance depending on some assumption 
on the value of the kurtosis. However, it seems that the variance can i n pr actice 
be estimated by the usual unbiased estimator V, defined by equation (4.1 page 



14 ), and plugged in the estimator of Proposition |2.4| (page|8~|), although there is no 



mathematical warrant for this simplified scheme. 
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